shout out to chatGPT helping me rewrite this.

https://chat.openai.com/share/adb3e59d-1e70-42d9-a7ae-b2a8f547bd6a

InΒ [Β ]:
import librosa
import numpy as np
from IPython.display import Audio
import matplotlib.pyplot as plt
import holoviews as hv
hv.extension('bokeh')
file_names = ['audio/03-01-01-01-01-02-01.wav',
              'audio/20 - 20,000 Hz Audio Sweep Range of Human Hearing.mp3', 
              'audio/videoplayback.mp3']
No description has been provided for this image No description has been provided for this image

Audio as Time-Series DataΒΆ

Representation: Time-series data, typically called audio_data, is plotted as a graph where time is on the X-axis and magnitude (or amplitude) is on the Y-axis.
Information Content: This representation does capture all the information about the sound we hear. The fluctuations in amplitude over time represent the sound wave's pressure variations, which correspond to the sound we perceive.
Perception of Sound: It's crucial to distinguish between the physical properties of sound (amplitude, frequency) and how we perceive these properties (loudness, pitch). Amplitude in a time-series graph relates to the loudness of the sound, but not directly to its pitch or quality.

Challenges and IssuesΒΆ

Interpretation vs. Intuition: While the graph provides a direct view of sound amplitude over time, understanding the sound's characteristics from this alone can be non-intuitive.
For example:

  • High amplitude followed by a sudden drop could indicate a loud sound abruptly stopping.
  • A gradual increase in amplitude might suggest a sound gradually getting louder. However, without information on frequency, it's hard to determine the nature of the sound (e.g., a musical note rising in pitch).

Perception of Sound and Amplitude: Amplitude corresponds to the loudness of a sound, but our perception of sound is multidimensional, involving pitch (related to frequency), timbre (related to the complex structure of sound waves), and duration. Simply observing amplitude variations does not provide a complete picture of these aspects.

Time Shift Sensitivity: Time-series data is sensitive to time shifts. A slight shift in the time domain can change the appearance of the waveform, potentially leading to different interpretations, especially in complex sounds or music.

InΒ [Β ]:
for i in range(len(file_names)):    
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)

    time = librosa.times_like(audio_data, sr=sample_rate)
    plot = hv.Curve((time, audio_data)).opts(width=1100, height=400, title="Waveform: " + file_name)

    display(plot)
    display(Audio(data=audio_data, rate=sample_rate, autoplay=False))
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.

Little bit of time and little bit of frequency.ΒΆ

Understanding Human Hearing through Frequency AnalysisΒΆ

Frequency Analysis: Humans perceive sound not just in terms of loudness (amplitude) but also pitch, timbre, and rhythm. Frequency analysis allows us to break down complex audio signals into their constituent frequencies, helping us understand these aspects better.
Analyzing Characteristics: By examining audio in the frequency domain, we can:

  • Understand tones (individual frequencies).
  • Analyze timbre (the quality or color of sound), which is determined by the combination of frequencies and their amplitudes.
  • Determine pitch (related to the fundamental frequency of the sound).
  • Identify octaves and other musical properties.

Discrete Fourier Transform (DFT)ΒΆ

Analysis of Audio Characteristics: In the frequency domain, characteristics like tones, timbre, pitch, octaves, etc., can be more intuitively analyzed as compared to the time domain. This is because these characteristics are directly related to the frequencies present in the sound. Your statement here aligns well with standard practices in audio signal processing.

Time-Frequency Trade-off: The DFT converts time-domain data (signals as a function of time) into frequency-domain data (signals as a function of frequency). This is a fundamental aspect of Fourier analysis and is correctly stated.

Loss of Time Information: When applying the DFT, the time variable $t$ is effectively replaced by the frequency variable. This means that while the DFT provides detailed information about the frequencies present in the entire signal, it does not retain information about when these frequencies occur within the time span of the signal. This is an important concept and is accurately described in your summary. The loss of temporal information can indeed be significant in contexts where the timing of sound events (like changes in music or speech) is crucial.

Limitation of Naive DFT Application: Applying DFT without considering its limitations can lead to a loss of time-related information, which is important in many audio applications.

Short-Time Fourier Transform (STFT)ΒΆ

Balancing Time and Frequency: STFT provides a balance between time and frequency information, allowing for the analysis of how frequencies in a signal change over time.
Dynamic Analysis: It's particularly useful for signals whose frequency components vary over time, like in speech or music.



def STFT(signal, window_size, step_size):
    # python psuedo code of STFT
    num_segments = get_num_segments(signal, window_size, step_size)
    X = np.zeros((num_segments, window_size), dtype=complex)
    for i in range(num_segments):
        window_range = get_window_range(i, window_size)
        segment = signal[window_range]
        windowed_segment = segment * np.hanning(window_size) # using hanning as a window function.
        X[i, :] = fft(windowed_segment)
    return X

the function return X(n, w), where 'n' select the window and 'w' select the frequency. in mathematical terms $X(n, \omega)$.

ChallengesΒΆ

While the STFT provides a more nuanced view of a signal by preserving some time information, there are still inherent limitations:

Time-Frequency Resolution Trade-off: The choice of window size in STFT creates a trade-off:

  • A longer window provides better frequency resolution but poorer time resolution.
  • A shorter window gives better time resolution but poorer frequency resolution.

Windowing Artifacts: The use of window functions can introduce artifacts such as spectral leakage or windowing effects, which might distort the true spectral content of the signal.

Dynamic Changes: For signals with rapidly changing spectral content, even STFT might not be able to capture all the nuances, as it assumes the signal within each window is stationary.

InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    stft = librosa.stft(audio_data)

    plt.figure(figsize=(16, 5))
    librosa.display.specshow(stft, sr=sample_rate, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Magnitude of STFT')
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
C:\Users\pongs\AppData\Local\Temp\ipykernel_9444\3372860683.py:8: UserWarning: Trying to display complex-valued input. Showing magnitude instead.
  librosa.display.specshow(stft, sr=sample_rate, x_axis='time', y_axis='log')
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

SpectrogramΒΆ

  • STFT for visualization
  • it take magnitude STFT and logarithmic scale for amplitude, which can make certain features more prominent or visible.
  • used for visualization.
  • In ML pipeline, it is hard to say when Spectrogram will outperform Magnitude of STFT since they represent same information in a different scale.
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    stft = librosa.stft(audio_data)

    magnitude_stft = np.abs(stft) # discard phase-information
    db = librosa.amplitude_to_db(magnitude_stft, ref=np.max) # use np.max so utilize entire frequency channels.

    # Plot
    plt.figure(figsize=(16, 5))
    librosa.display.specshow(db, sr=sample_rate, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Spectrogram')
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

mel spectrogramΒΆ

  • Mel Spectrogram: A mel spectrogram is a type of spectrogram where the frequency scale is converted to the mel scale. This scale is an approximation of human auditory perception, meaning it more closely represents how humans perceive pitch.
  • Focus on Low Frequencies: The mel scale gives more weight to lower frequencies (which humans are more sensitive to) and less to higher frequencies. This is based on the observation that humans perceive differences in lower frequencies more accurately than in higher frequencies.
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    S = librosa.feature.melspectrogram(y=audio_data, sr=sample_rate)
    log_S = librosa.power_to_db(S, ref=np.max)

    plt.figure(figsize=(16, 5))
    librosa.display.specshow(log_S, sr=sample_rate, x_axis='time', y_axis='mel')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Mel spectrogram')
    plt.tight_layout()
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

Selecting which to use.ΒΆ

Representation of Information: SFFT, spectrograms and mel-spectrograms represent the frequency content of audio signals over time, but they differ in their frequency scale representation.

Subjectivity in Preference: While these generalizations hold true in many cases, there's a degree of subjectivity. Some individuals might prefer one representation over another based on the specific task or their familiarity with the data.

Model Specificity: Similarly, some machine learning models may perform better with one type of representation over another, depending on the nature of the task, the architecture of the model, and the characteristics of the dataset.

Dataset Characteristics: The type of audio data (e.g., musical pieces, spoken words, ambient sounds) also plays a role in determining the most effective representation.

Other representationsΒΆ

Chroma features (aka Pitch Class Profiles (PCP))ΒΆ

src: https://s18798.pcdn.co/jpbello/wp-content/uploads/sites/1691/2018/01/6-tonality.pdf

  • Terminologies:
    • Octave: An octave in music is an interval between one musical pitch and another with double its frequency. This means that when you move up an octave, you are doubling the frequency of the original pitch. For example, if you start with the note A at 440 Hz, the next A in the higher octave would be at 880 Hz.
    • Pitches: Pitches in music refer to specific musical notes. In Western music, there are 12 distinct pitches in each octave, which include notes like A, A#, B, C, C#, D, D#, E, F, F#, G, and G#. Each of these pitches corresponds to a specific frequency.

The pitch helix: Models the special relationship that exists between octave intervals.
Height: naturally organizes pitches from low to high
Chroma: represents the inherent circularity of pitch organization

pitch_helix

!!! When the frequency of a musical note is doubled, it is perceived by our ears as being the same pitch, but in a higher octave.

  • Chroma features are used to represent the presence or absence of these 12 pitch classes in a piece of music, regardless of their octave.

    pitch_helix

https://youtu.be/WAnGsp9wajk?t=58

InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)    
    
    chroma = librosa.feature.chroma_stft(y=audio_data, sr=sample_rate)
    librosa.display.specshow(chroma, x_axis='time')
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

Chroma features is based on Western music theoryΒΆ

12 Pitch Classes: Chroma features are rooted in Western music theory, which is based on the concept of 12 distinct pitches (or pitch classes) in an octave. Hence it is a powerful tool for analyzing the harmonic content of music, particularly when it involves Western instruments

Cepstral AnalysisΒΆ

  • primarily used as a preprocessing step in machine learning pipelines as feature extraction technique.
  • not used for visualization.
  • src: https://medium.com/@abdulsalamelelu/mfcc-the-dummys-guide-fd7fc471db76
  • The key idea behind cepstral analysis is to transform the convolution of signals (common in audio processing) into addition
  • Assumption: convolution is involved in the original signal, which is generally valid.
    cepstral_formula

Mel-Frequency Cepstral Coefficients (MFCC)ΒΆ

src: https://www.researchgate.net/figure/The-main-steps-for-calculating-MFCC_fig5_266895811

extensively used in speech and audio processing tasks

mfcc steps

  1. windowing: divide the audio signal into short frames.

windowing

  1. DFT: Take DFT of each window

  2. Mel Frequency Wrapping: The Mel scale emphasizes lower frequencies, reflecting the human ear's increased sensitivity to these frequencies. It de-emphasizes higher frequencies, which are less critical for understanding human speech.

mel freq wrapping

  1. Logarithm:
  • Converting Multiplication to Addition
  • The logarithm reduces the effect of large variations in signal strength, which helps in handling these variations more effectively in the subsequent stages of processing.
  • The logarithmic scale more closely mimics the human ear's perception of sound
  • (It also say to decorrelate features but I don't understand what it mean.)
  1. DCT:

Reading MFCC plot:

  • x-axis: time
  • y-axis: MFCC components. These are the coefficients obtained from the DCT.
  • brightness: magnitude or value of a particular MFCC component at a particular time.
  • Spectral Features: Each MFCC component captures different characteristics of the sound’s spectrum. Lower-order coefficients often capture more general characteristics, while higher-order coefficients capture finer details.
InΒ [Β ]:
for i in range(len(file_names)):
    file_name = file_names[i]
    audio_data, sample_rate = librosa.load(file_name)
    
    n_mfccs = [5, 13, 20, 50]
    mfccs = [librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=n_mfcc) for n_mfcc in n_mfccs]

    fig, axs = plt.subplots(2, 2, figsize=(14, 6))
    for ax, mfcc, n_mfcc in zip(axs.ravel(), mfccs, n_mfccs):
        librosa.display.specshow(mfcc, x_axis='time', ax=ax)
        ax.set_title(f'n_mfcc: {n_mfcc}')

    plt.suptitle('mfcc plot with different number of components.')
    plt.tight_layout()
    plt.show()

    display(Audio(data=audio_data, rate=sample_rate, autoplay=True))
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.
No description has been provided for this image
Your browser does not support the audio element.

ThanksΒΆ